1 Introduction

Moving to a new city on a tight budget is challenging. Especially, a metropolis like London has high rents and a competitive market that makes it difficult to find accommodation that has the right attributes at the right price. Sharing economy services such as AirBnB have faciliated the search for a spare room rented out by a private agent. The available rooms and apartments are available for the user to settle right in. But how do you know if the price you are paying for your flat is actually a fair price?

This paper aims to create a model which will identify how much AirBnB guests value different attributes of London rooms. The guests can also refer to in order to check whether they are paying an appropriate price for their AirBnB.

2 Description of the dataset

The dataset of this paper covers all AirBnB offerings in London as per the 4th and 5th of March 2017. It contains 53,904 observations for 95 different variables. Its source is the website “Inside AirBnB - Adding data to the debate” (Cox, 2017). This is an independent and non-commercial project aiming to examine the effect of AirBnB activities on urban development.

To allow this investigation to be more focused, the dataset was narrowed down. Only private rooms with at least three valid ratings were included. The resulting dataset has 6,495 observations for 78 variables.

2.1 Price

Table 1: Descriptives of the Price
Min Q1 Median Mean Q3 Max
8 35 45 50 59 590

A room in London costs on average 50 GBP per night. The summary statistics show that 75% of all AirBnBs are priced at £59 per night or less. However, there are some severe outliers that range up to a maximum of £590.

This raises concerns about the normality of its distribution. In fact, the plot to the left shows the distribution is not normal. In order to normalize the presented data set, the price is converted with a natural logarithm.

Figure 1: Density of Price and ln(Price)

Figure 1: Density of Price and ln(Price)

2.2 Rent

## 
##  Pearson's product-moment correlation
## 
## data:  data_short$mean_rent and data_short$price_log
## t = 46.089, df = 7018, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4638676 0.4997879
## sample estimates:
##       cor 
## 0.4820303

With London being one of the most expensive cities to live, rent prices are a major cost of being a host on AirBnB. Rents are also an interesting indicator of the attractiveness of the neighbourhood. Therefore, the impact of the underlying rent on the AirBnB price has to be accounted for. The initial dataset holds no information on the regular rent price at the location of an AirBnB. Fortunately, a website called “Find Properly” (Lokku Ltd., 2017) utilizes the data from Zoopla and provides the rent and selling price for each London region, divided per post code. Using the post code, the average weekly rent for 1-bed properties is merged with the AirBnB data set. The matching was done based on the Outward code.

Geographical mapping the mean rent and the logarithmically transformed AirBnB price reveals the positive correlation (+ 0.48) between the variables. Nevertheless, it also becomes clear that there is more to an AirBnB price than just the average rent in the particular neighbourhood.

Figure 2: Mapping Rent Prices vs. AirBnB Prices

Figure 2: Mapping Rent Prices vs. AirBnB Prices

2.3 Location

When choosing an AirBnB in London, staying close to the city centre is prefered by many. Distance is defined as the distance to the touristic city centre - Picadilly Circus. It was calculated by using the Haversine formula (Reid, 2011) and the geographic coordinates of Picadilly Circus (Longitude: -0.133869, Latitude: 51.510067) (Latlong.net, 2017). The correlation between distance and the logarithmic AirBnB price is negative and weak (-0.39). The closer the property is to the city center, the higher the price is. Upon analyzing the different bins, the most high end outliers are located close to Picadilly Circus. Price range also shrinks the further the flat is from the city center.

Figure 3: Mapping Rent Prices vs. AirBnB Prices

Figure 3: Mapping Rent Prices vs. AirBnB Prices

2.4 Reviews

Reviews could be a useful indicator of various characteristics of the room advertised. In addition to the written reviews, guests can give their hosts star-ratings on the following parameters (Airbnb Inc., 2017): Overall experience, accuracy, cleanliness, communication, check in, location and value. Most of those are self-explanatory; accuracy represents the extent to which the online listing represtents the reality, and value is a subjective measure of whether the room was worth the price paid.

The guest ratings are translated into a score out of 10 for the individual categories, and a score out of 100 for the overall score. The mean value for many categories is 9 or 10. Such high scores are frequently seen when feedback from users is collected. For example, Uber considers removing drivers rated on average less than 4.6 stars out of 5 (Insider, 2015).

Since the overall score is submitted independently, rather than calculated from the category scores, it is interesting to see which categories affect the user’s overall rating the most. All the subcategory rating have at least a moderate, positive relation to the overall score. The correlations between the overall score and value, check in and accuracy are the strongest, suggesting that those categories matter most for the guest’s overall satisfaction. In general, there is no significant relation between the different rating scores and price, suggesting the use of these indicators will have little effect on the goodness of fit. Location, however, has a weak positive correlation to the logarithmic price, making it an interesting indicator for the security, comfort and attractiveness of the neighbourhood.

Table 3: Descriptives of Rating
Name Minimum Maximum Mean Correlation to Overall Score Correlation to Price
Accuracy 2 10 9 0.77 0.09
Check In 2 10 10 0.78 0.14
Cleanliness 2 10 9 0.67 0.10
Communication 4 10 10 0.68 0.10
Location 3 10 9 0.54 0.30
Value 2 10 9 0.79 0.04
Overall 20 100 92 1.00 0.13

2.5 Property Characteristics

2.5.1 Accommodates and Beds

Table 4: Descriptives of Capacity
Variable P-Value Conf-Int. Low Estimate Conf-Int. High
Accomodates 6.679629e-187 0.32 0.34 0.36
Beds 1.653387e-71 0.19 0.21 0.23

The variables accommodates (how many people can stay in the property) and beds (the number of beds in the property) give an indication of the overall capacity of the AirBnB. Both variables have a relation to the room price that is significantly different to zero. However, both correlations are weak, suggesting that even though price rises with capacity, it rises slowly.

2.5.2 Amenities

Figure 4: Percentage of Amenities

Figure 4: Percentage of Amenities

AirBnB includes some general information on the property such as the room type, the number of people that can be accommodated or the number of bathrooms. On top of these characteristics, AirBnB contains information on a wide range of amenities for every flat. These range from the availability of Internet and a TV to a personal doorman or a pool. In order to analyse these, dummy variables for 53 different amenities, with 46 resulting in usable data, as well as a variable counting the total number of amenities were introduced.

Seven amenities influence the price, including some home essentials such as a TV, a dryer, and a washer and facilities like an elevator. Also, whether the room has a door lock and whether it is a family- and kid-friendly environment matters. The presence of a TV, an elevator, a dryer or a washer tends to have a positive impact on the flat’s price, this is especially true for the TV. Interestingly, it seems that AirBnBs that have a kitchen and a lock on the bedroom door seem to be slightly less valued. Perhaps a lock on the bedroom door is more commonly in place in less safe locations.

Table 5: Descriptives of Selected Amenities
Amenities P-Value Mean ln(price) With Mean ln(price) Without Price Difference
Washer 0.03 3.83 3.80 0.03
TV 0.00 3.89 3.74 0.15
Familiy / Kid-Friendly 0.00 3.87 3.79 0.08
Dryer 0.00 3.92 3.77 0.15
Kitchen 0.52 3.82 3.83 -0.01
Elevator in Building 0.00 3.89 3.79 0.10
Lock on Bedroom Door 0.00 3.79 3.83 -0.04

2.6 Attributes of the ad

Usually, a guest needs to submit a booking request and gets to stay in the property only if the host approves that request. To attract more customers, some hosts allow instant booking of their properties, which is similar to booking a hotel - the user just books the property straight away. In the dataset, TRUE means guests can book the desired property instantly, while FALSE means they have to get approval from the host first.

In addition to instant book, hosts also have the right to choose their own cancellation policy. Cancellation policy determines whether or not guests can get a refund and how they can be refunded. There are several cancellation policies form which hosts can choose, including flexible, moderate, strict and super strict. If flexible, guests may get a full refund if the reservation is cancelled within a limited period, typically 24 hours prior to the check in. If moderate, fees are fully refundable but only if cancelled a longer time in advance. Under the strict policy, only 50% of fees may be refunded if the booking is cancelled more than 1 week before check in. (Airbnb Inc., 2017) While the difference in mean of rooms with instant bookings is insignificant, the correlation between the scale version of cancellation policy is significantly but weakly correlated to the price of the room.

## function (x, do.NULL = TRUE, prefix = "col") 
## {
##     if (is.data.frame(x) && do.NULL) 
##         return(names(x))
##     dn <- dimnames(x)
##     if (!is.null(dn[[2L]])) 
##         dn[[2L]]
##     else {
##         nc <- NCOL(x)
##         if (do.NULL) 
##             NULL
##         else if (nc > 0L) 
##             paste0(prefix, seq_len(nc))
##         else character()
##     }
## }
## <bytecode: 0x7f89295bfc68>
## <environment: namespace:base>
Table 6: Descriptives of Ad Properties
attributes p_vals
Instant Bookable 8.435e-01
Cancellation Policy 1.513e-08

3 Regression model

Table 7: Regression Results
Dependent variable:
Ln Price
(1) (2)
Mean Rent 0.001*** (0.0001) 0.001*** (0.0001)
Distance -0.013*** (0.001) -0.013*** (0.001)
Review Score - Rating 0.007*** (0.001)
Review Score - Accuracy -0.016** (0.008)
Review Score - Check-In 0.014* (0.008)
Review Score - Cleanliness 0.040*** (0.006) 0.037*** (0.004)
Review Score - Communication 0.001 (0.009)
Review Score - Location 0.080*** (0.006) 0.072*** (0.006)
Review Score - Value -0.083*** (0.008)
Accomodates 0.160*** (0.006) 0.151*** (0.005)
Number of Beds -0.023** (0.010)
Amenity - Dryer 0.072*** (0.008) 0.062*** (0.008)
Amenity - Elevator 0.045*** (0.008) 0.043*** (0.008)
Amenity - Family friendly 0.007 (0.008)
Amenity - Lock on Bedroom Door -0.045*** (0.010) -0.049*** (0.010)
Amenity - TV 0.116*** (0.008) 0.116*** (0.008)
Amenity - Washer -0.044*** (0.010)
Instant bookable - FALSE 0.004 (0.010)
Cancellation Policy - Moderate -0.008 (0.009)
Cancellation Policy - Strict 2.111*** (0.070) 2.003*** (0.054)
Observations 7,020 7,020
R2 0.433 0.420
Adjusted R2 0.432 0.420
Residual Std. Error 0.304 (df = 7000) 0.307 (df = 7010)
F Statistic 281.780*** (df = 19; 7000) 565.119*** (df = 9; 7010)
Note: p<0.1; p<0.05; p<0.01

3.1 Interpretation

As our dependent variable was transformed to its logarithmic version, a log-linear regression model is used to explain the effect of the independent variables on the dependent variable. Comparing the two versions of the model, it becomes clear that some variables are insignificant, some have a multicollinearity problem and the review scores for value is likely to have an endogenity problem. Additionally, some of the amenities had a negative impact on the price, which is conterintuitive and contradicts the results of the t-test. As the effects are small and likely to be caused by random noise, such variables are excluded:

\[\begin{gather*} ln(price) = \beta_0 + \beta_1(mean\_rent) + \beta_2(distance) + \beta_3(accomodates) + \\ \beta_4(review\_scores\_rating) + \beta_5(review\_scores\_cleanliness) + \beta_6(review\_scores\_location) + \\ \beta_7(TV) + \beta_8(elevator) + \beta_9(dryer) + u \end{gather*}\]

42 percent of the variation of the dependent variable can be explained with the presented regression model. The standard error of the model in absolute currency is approximately 1.36 GBP off from the real value and the F-statistic is highly significant. Thus, the model provides a far better explanation than just the fit intercept model. The y-intercept is located at 7.41 GBP. However, there will not be an apartment that does have a rent of zero or can accomodate no one. Therefore, the intercept has to used rather carefully. The other coefficients are explaining by how many percentage points the price changes if the explanatory variable changes by one unit holding all other independent variables constant. For example, for every additional person a room can accomodate, the price rises by 15 percent. The former, the review scores for location as an indicator for attractiveness of the neighbourhood and amenities like the existence of a TV, elevator and dryer have the largest postive effects on the price of a room. The review scores of value had to be taken out of the regression because the inherint endogenity problem: Price is a large factor in determining the review of a guest regarding the price for value. The effect of distance is surprisingly small. This impies that distance to city center is not the best measure to account for geographic differences.

3.2 Fitting the model

Table 8: Residuals
Name Ln Mean Rent Mean Rent
Data with high residuals 4.37 85.05
Data with low residuals 3.73 44.14

A short exploration of the residuals shows that with rising prices, the residuals increase. This implies that our model is worse in predicting the more expensive rooms as the factors chosen do not fully explain the difference in price. The relation between residuals and prices may be explained by a factor the model could not quantify: the attractiveness of the room and the house it is in. As this attractiveness differs across buildings and sometimes even within a building, it is impossible to predict the price of an apartment that exceeds expectations set by the base explanatory variables used in the regression.

Figure 5: Residuals and Prediction

Figure 5: Residuals and Prediction

3.3 Limitations of our model

Due to the scope of this assignment we were not able to address every issue with our data. Several issues and will be discussed here.

3.3.1 Omitted variable bias

The price of an AirBnB is affected by a large number of factors. We built a model that includes some of them, but it was not feasible to include data concerning every single possible determinant. As a result, our model likely suffers from omitted variable bias. It under- or overestimates the effect of some of the existing factors to compensate for the missing information, making our model less reliable. Our dataset doesn’t contain several important variables, such as the size of the room, the proximity of the flat to a tube station, the age of the flat,the quality of the equipment and furniture in the flat or the attractiveness of the apartment and the building.

3.3.2 Multicollinearity

Upon testing for multicollinearity, correlations between the explanatory variables become clear. Those relations might increase the error terms of the model. A VIF of four implies that the variance of the estimators in the model are four times higher than if the independent variables were uncorrelated. Usually, a VIF greater than 3 is considered critical to the model results. None of the used variables reaches that border value. The Durbin-Watson test shows that the error values are uncorrelated, as visible in the plot of residuals against predicted values.

Table 9: Results Multicollinearity Test
VIF
Mean Rent 1.76
Distance 1.68
Review Score - Cleanliness 1.28
Review Score - Location 1.37
Accomodates 1.03
Amenity - Dryer 1.05
Amenity - Elevator 1.02
Amenity - Lock on Bedroom 1.03
Amenity - TV 1.07
Table 10: Results Durbin Watson Test
Autocorrelation D-W Statistic p-value
0.06 1.88 0

3.3.3 Sensitivity to outliers

As in any regression model based on ordinary least squares, the coefficients in our model are affected by outliers. Some of the properties in our data set cost more than $400 per night, while most of them cost below $100. The outliers may have disproportionately affect our coefficients, making them less accurate for the remaining variables.

3.3.4 Non-linear relationships

Figure 6: Non-Linear Relationship between Price and Location

Figure 6: Non-Linear Relationship between Price and Location

Some of our explanatory variables (for example the distance from the city center) are not linearly related with the price of the property. There is a significant difference between the average price of a room located right in the city center and 5km away, while the difference for rooms located 25km away and 30km away is not very large. This suggests that we could model the relationship more accurately if we used non-linear regression.

3.3.5 Lack of clustering

By putting all properties into one model, we ignore the fact that there might be different profiles of properties and for each profile, different characteristics might be relatively more important. Perhaps there is a set of properties that are popular with students coming to London for graduate job interviews, who would see location close to the financial centers and low price as important factors. And, perhaps, different types of properties are popular with middle-aged tourists - then the proximity to the popular sights and the level of comfort provided might matter more. If we divided our properties into clusters which share similar characteristics, and then ran a regression analysis for each cluster, we might get a more accurate model for each cluster.

4 Conclusion

Despite the fact that the presented model has obvious limitations regarding factors that could not be quantified, it has direct implications for finding a reasonably prices apartment. Many attributes important for someone searching for a room, like WiFi and the existence of a proper equipped kitchen, have small effects on the room price, as they are present in most London apartments. A traveller can therefore expect to have those properties present. Luxury amenities like the presence of an elevator, a TV and a dryer create costs. Depending on the standards of the guest, these can be added if the budget is extended. It is also good advice to check apartments in less attractive neighbourhoods to save money. Regarding cleanliness, a well maintained room will cost more. Looking at these different attributes of an AirBnB ad, the user is able to determine whether the price of the apartment is actually fair, which was the aim of this report. As especially high prices could not be explained in the model, a prediction is likely to return a base price rather than a highly attractive room in a good apartment in a nice building.

Bibliography

Airbnb Inc. (2017) How do star ratings work. [Online]. Available from: https://de.airbnb.com/help/article/1257/how-do-star-ratings-work.

Cox, M. (2017) Inside airbnb - adding data to the debate. [Online]. Available from: http://data.insideairbnb.com/united-kingdom/england/london/2017-03-04/data/listings.csv.gz.

Latlong.net (2017) Get latitude and longitude. [Online]. Available from: https://www.latlong.net.

Lokku Ltd. (2017) London house prices by postcode. [Online]. Available from: https://www.findproperly.co.uk/london/postcode/#.WdvonHeZNn4.

Reid, M. (2011) Haversine formula. [Online]. Available from: http://wordpress.mrreid.org/2011/12/20/haversine-formula/.

Furthermore, for plotting our observations on a ggmap, we consulted the following sources:

Irawan, D.E. (2014) How to convert lat-long coordinates to utm. [Online]. Available from: https://rpubs.com/dasaptaerwin/19879.

Lovelace, R. & Cheshire, J. (2014) Introduction to visualising spatial data in R. National Centre for Research Methods Working Papers. [Online] 14 (03). Available from: https://github.com/Robinlovelace/Creating-maps-in-R.

The header photo was downloaded from Pexels and is licence free. Available from: https://www.pexels.com/photo/architecture-buildings-business-capital-417382/

Imperial College Business School